contract + ontology: complete the classid → ClassView resolution (UNICHAR adapter, keystone, OGAR→ontology wiring)#534
Conversation
Records the merged #521 (lance-graph-contract C++ codegen target MethodSig + UniCharSet content store) per the Mandatory Board-Hygiene Rule's post-merge step. PR_ARC_INVENTORY prepend (Added/Locked/Deferred/Docs/Confidence) + LATEST_STATE narrative entry + "Recently Shipped PRs" table row. Captures the PROBE-OGAR-ADAPTER-UNICHARSET FINDING: the full transcode pipeline (ruff ruff_cpp_spo harvest -> reassemble -> ruff_cpp_codegen -> these contract types) produces a UniCharSet byte-identical 112/112 to the libtesseract oracle on real eng data, proving the core-first transcode doctrine end-to-end. Pairs with ruff #20. Merge commit 620bd8e. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Transcode of Tesseract's `ccutil/unichar.cpp` (the UTF-8 layer UNICHARSET sits on top of) as a pure-Rust, zero-leptonica adapter — the second leaf through the harvest -> reassemble -> codegen pipeline after UniCharSet, proving PROBE-OGAR-ADAPTER-UNICHARSET generalizes beyond one class. New `unichar` module: - `utf8_step(lead) -> u8`: const-fn transcription of the 256-entry lead-byte table (unichar.cpp:143). 1/2/3/4 for legal leads, 0 for continuation bytes and 0xF8.. . - `utf8_to_utf32(bytes) -> Option<Vec<i32>>`: mirrors UNICHAR::UTF8ToUTF32 (unichar.cpp:220) — lead-byte validation only, None on illegal lead, the offset-decode of first_uni (unichar.cpp:105) inlined. Byte-parity: `examples/unichar_dump.rs` vs a libtesseract UNICHAR oracle is 268/268 identical — all 256 utf8_step values (EXHAUSTIVE) + 12 utf8_to_utf32 corpus rows. Why a faithful transcode and not core::str: Tesseract maps 0xC0/0xC1 to step 2 and decodes the overlong NUL `C0 80` to [0]; core::str::from_utf8 rejects both. A native-UTF-8 shortcut would silently diverge from the oracle. The `from_utf8_rejects_what_tesseract_accepts` test pins the gap. Additive, zero-dep, pure text. +8 tests; 653 contract lib green; clippy --all-targets -D warnings + fmt clean. Board: LATEST_STATE D-UNICHAR-1 + EPIPHANIES E-CPP-PARITY-2. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…dapter
Wires the proven UniCharSet adapter (E-CPP-PARITY-1, byte-parity 112/112)
through the OGAR Core's three movable parts — steps 2-3 of
PROBE-OGAR-ADAPTER-UNICHARSET, which prior work left as "mechanical wiring,
conjectured." This proves the core-first transcode doctrine END-TO-END for the
unicharset class, not just for one leaf's bytes.
New `unicharset_adapter` module:
- `UniCharSetStore` trait: the classid-keyed content-store tier (consumer-
provided, dependency-inverted like ClassView). The adapter holds NO state;
the variable-length bijection rides this tier (I-VSA-IDENTITIES).
- `UniCharCall` (DO-in) / `UniCharOut` (DO-out, zero-copy borrow) / `DispatchError`.
- `invoke_unicharset(registry, store, classid, call)` — the keystone:
1. ClassView composition gate: codegen_manifest::methods_for(registry,
classid) must list the method (the harvested has_function manifest),
else MethodNotComposed (zero-fallback: unconfigured classid composes
nothing).
2. content-store tier: UniCharSetStore::unicharset(classid).
3. adapter leaf: UniCharSet::{id_to_unichar, unichar_to_id}.
Byte-parity is inherited from UniCharSet; the keystone proves the dispatch path
is faithful (the NULL->space edge survives it), the gate works, and there is NO
Core gap (the doctrine's iron guard holds with zero strain). Not routed through
the heavy OrchestrationBridge cross-subsystem router; this is the adapter-
invocation primitive a UnifiedStep calls.
Additive, zero-dep. +5 tests; clippy --all-targets -D warnings + fmt clean.
Board: LATEST_STATE D-UNICHARSET-KEYSTONE; EPIPHANIES E-CPP-KEYSTONE-1;
core-first-transcode-doctrine.md steps 2-3 marked wired.
Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Closes the OGAR -> lance-graph-ontology gap an audit this session found: NiblePath::from_guid_prefix (the canon GUID->NiblePath fold) and the registry's entity_type <-> NiblePath bijection were BOTH built with ZERO callers -- the two halves of the bridge forged but never chained. OntologyRegistry::class_id_for_guid(&NodeGuid) -> Option<ClassId> lays the join: from_guid_prefix(guid)? -> entity_type_of(path). A node row carrying a classid now resolves its ontology class (entity_type / ClassId), which RegistryClassView already turns into the class shape (fields/labels/template/DOLCE). No new predicate, no new type -- a method composing two existing surfaces (aligns with E-ODOO-CORE-FIRST-STRUCTURAL: Core-side resolution, not an SPO bolt-on). Round-trip test pins the classid_lo <-> entity_type consistency the audit flagged: register_class_path(t, from_guid_prefix(g)) => class_id_for_guid(g) == Some(t); zero-fallback (unbound GUID -> None); lossy-fold refusal (high classid u16 -> None). 16 ontology tests green; registry.rs clippy-clean + fmt clean. Board: LATEST_STATE wiring entry; EPIPHANIES E-OGAR-ONTOLOGY-WIRED-1; TECH_DEBT TD-ONTOLOGY-LINT (pre-existing crate clippy debt, present on main, un-CI-gated). Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
|
Warning Review limit reached
More reviews will be available in 17 minutes and 36 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7834828a72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if step == 0 { | ||
| return None; // illegal lead | ||
| } | ||
| out.push(first_uni(&bytes[i..])); |
There was a problem hiding this comment.
Reject truncated UTF-8 sequences
step only proves that the lead byte is not illegal, but this path still calls first_uni when the slice does not contain all bytes for that character. For malformed OCR input ending in a multibyte lead such as [0xC3], first_uni consumes the one available byte, subtracts the 2-byte offset, and pushes a negative/fabricated codepoint before advancing past the end. Since this public byte-parity decoder accepts length-delimited slices, add a remaining-length check (and return None or otherwise mirror the oracle) before decoding.
Useful? React with 👍 / 👎.
) utf8_to_utf32 only checked the lead byte was legal; for a truncated trailing multibyte lead (e.g. [0xC3], or a 3-byte lead with 2 bytes present) it still called first_uni on the short slice, where take(len) decodes from the partial bytes and the offset subtraction fabricates a codepoint ([0xC3] -> Some([64])). The C++ UTF8ToUTF32 reads past its buffer here (UB on length-delimited input); this length-delimited decoder now rejects it (i + step > len -> None) instead of fabricating. Byte-parity unchanged (268/268 vs the libtesseract oracle — the corpus has no truncated cases). +1 test (truncated_trailing_multibyte_is_rejected); docs on utf8_to_utf32 + first_uni updated. clippy --all-targets -D warnings + fmt clean. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
PR_ARC_INVENTORY #534 entry + LATEST_STATE "Recently Shipped PRs" row for the merged #534 (classid → ClassView resolution surface: UNICHAR adapter, keystone, OGAR→ontology wiring). Per the Mandatory Board-Hygiene Rule's post-merge step. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Summary
Completes the
classid → ClassViewresolution surface of the OGAR Core. After this PR, all three "classid → X" axes resolve from a canonical-node GUID:classid_read_mode(viaocr.rs)codegen_manifest::methods_for(the harvestedhas_functionmanifest)OntologyRegistry::class_id_for_guid→RegistryClassViewThis dovetails with the just-merged #533 (
virtually_overridesas a computed ClassView relation) and theE-ODOO-CORE-FIRST-STRUCTURALdirection (#530): all Core-side resolution, no new flat-ndjson predicates.What's in it (4 commits)
1. UNICHAR UTF-8 codec — second byte-parity adapter (
lance-graph-contract::unichar)Transcode of Tesseract's
ccutil/unichar.cpp:utf8_step(const-fn 256-entry lead-byte table) +utf8_to_utf32. 268/268 byte-identical to a libtesseract oracle (256 exhaustiveutf8_stepvalues + 12 decode rows). Proves the transcode generalizes to a 2nd class, and shows why a faithful transcode is mandatory: Tesseract maps0xC0/0xC1to step 2 and decodes overlong NULC0 80→[0], whichcore::str::from_utf8rejects — a native-UTF-8 shortcut would silently diverge (pinned byfrom_utf8_rejects_what_tesseract_accepts). +8 tests.2. UniCharSet keystone —
classid → ClassView → adapter(lance-graph-contract::unicharset_adapter)Composes the proven
UniCharSetadapter through the OGAR Core's three movable parts — steps 2–3 ofPROBE-OGAR-ADAPTER-UNICHARSET.invoke_unicharset(registry, store, classid, call): (1) ClassView composition gate viamethods_for, (2) classid-keyed content-store tier (UniCharSetStore; the adapter holds no state), (3) the proven leaf. Byte-parity is inherited; the keystone proves the dispatch path is faithful and there is no Core gap (the doctrine's iron guard holds). +5 tests. Flips the core-first doctrine to proven end-to-end.3. OGAR → lance-graph-ontology wiring (
OntologyRegistry::class_id_for_guid)Closes a gap an audit this session found:
NiblePath::from_guid_prefix(canon GUID→NiblePath fold) and the registry'sentity_type ↔ NiblePathbijection were both built with zero callers.class_id_for_guid(&NodeGuid) -> Option<ClassId>lays the join (from_guid_prefix(guid)? → entity_type_of(path)), so a node carrying a classid resolves its ontology class →RegistryClassView. Round-trip test pins theclassid_lo ↔ entity_typeconsistency; zero-fallback + lossy-fold refusal hold. +1 test.4. Board post-merge hygiene for #521 (PR_ARC + LATEST_STATE)
The post-merge governance entry owed for #521 (the contract
MethodSig+UniCharSetPR).Tests / gates
lance-graph-contract: 658 lib tests green;clippy --all-targets -D warningsclean; fmt clean.lance-graph-ontology: 16 class/bijection/wiring tests green;registry.rsclippy-clean + fmt clean.Notes
mainalready carries it.TD-ONTOLOGY-LINT):lance-graph-ontologyhas 12 clippy-D warningserrors on toolchain 1.95 in other files (owl.rs/op_emitter.rs/ttl_parse.rs), present onmain, un-CI-gated. Not touched here (out of scope);registry.rsitself is clean.🤖 Generated with Claude Code
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Generated by Claude Code